Recommender Systems

Recommender Systems are a set of methods able to predict the 'rating' or 'preference' that a user would give to an item. Among the different approaches to design this kind of systems, in this lab session we are going to work with Collaborative Filtering (CF) approaches, in particular, we will implement a user based CF model and the Alternative Least Square (ALS) approach.

Along the notebook we are going to use the dataset from MovieLens. MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota. The original version of this problem contains 10 millions of ratings applied to 10681 movies by 71567 users. However, for this lab, we will use a reduced version consisting of 100,000 ratings (with values from 1 to 5) from 943 users on 1682 movies, where each user has rated, at least, 20 movies.

As you progress in this notebook, you will have to complete some exercises. Each exercise includes an explanation of what is expected, followed by code cells where one or several lines will contain <FILL IN>. The cell that needs to be modified will have # TODO: Replace <FILL IN> with appropriate code on its first line. Once the <FILL IN> sections are updated, the code can be run; below this cell, you will find the test cell (beginning with the line # TEST CELL) and you can run it to verify the correctness of your solution.

Read data and preprocessing


In [1]:
# Import some libraries
import numpy as np
import math
from test_helper import Test
from scipy import sparse

In [2]:
# Define data file
ratingsFilename = 'u.data'

f = open(ratingsFilename)
rawRatings  = f.readlines()

# Check file format
print rawRatings[:10]


['196\t242\t3\t881250949\n', '186\t302\t3\t891717742\n', '22\t377\t1\t878887116\n', '244\t51\t2\t880606923\n', '166\t346\t1\t886397596\n', '298\t474\t4\t884182806\n', '115\t265\t2\t881171488\n', '253\t465\t5\t891628467\n', '305\t451\t3\t886324817\n', '6\t86\t3\t883603013\n']

Formatting the data

As you can checked, each line is formatted as: UserID \t MovieID \t Rating \t Timestamp \n

So, let's convert each line to a list with the fields [UserID, MovieID, Rating] (we drop the timestamp because we do not need it for this exercise). Note that the result of this conversion will provide a list of lists.

Tip: Check the Python function split() to convert each line into a list of items splitted by a given character.


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

formatRatings = # FILL IN

print formatRatings[:10]

In [4]:
###########################################################
# TEST CELL
###########################################################

Test.assertEquals(formatRatings[0], ['196', '242', '3'], 'incorrect result: data are incorrectly formatted')


1 test passed.

Create rating matrix

Now, let's convert the above list of list into a data matrix. Due to the fact that each user has only rated a few items, we will have a lot of missing positions in this matrix. So, to store this information efficiently, we should use a sparse matrix format.

Scipy library includes a class to create different sparse matrices sparse matrices. Given the data format that we are working, we will start creating a COOrdinate format (coo) matrix. This matrix can be created with the following command:

 coo_matrix((data, (i, j)), [shape=(M, N)]) 

where data contains the entries of the matrix (our ratings), i is the row indexes of the matrix entries (the user ids who have generated the ratings) and j the column indexes of the matrix entries (the item ids that have been rated). Furthermore, it is advisable to indicate the matrix shape, then, values M and N indicates, respectively, the number of rows and columns (number of users and items) of the matrix.

Complete the cell below, following the indications, to construct the sparse matrix.


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

# 1. Extract from the formatted data the list of users ids, item ids and the ratings. 
# Check zip(*) function to unzip the formatRatings variable
[user, item, rating] = # <FILL IN>

# 2. Convert the elements of each list to integer values and build an numpy array with the resulting list
# Due to user, movie and ratign are python lists, you will have to apply the int() operator element by element,
# check the list comprehension syntaxis to speed up this
user = # <FILL IN>
item = # <FILL IN>
rating = # <FILL IN>

# 3. Calculate the number of users, item and ratings
number_users = # <FILL IN>
number_items = # <FILL IN>
number_ratings = # <FILL IN>
print number_users 
print number_items
print number_ratings

# 4. Build the coo matrix (take into account that user and item ids start from 1 and python indexing starts in zero)
rating_matrix = coo_matrix((rating, (user-1,item-1)), shape =(number_users, number_movies)).tocsr()

In [6]:
###########################################################
# TEST CELL
###########################################################

Test.assertEquals((np.sum(user), np.sum(item) , np.sum(rating)), (46248475, 42553013, 352986), 'incorrect result: user, item or rating is incorrect')
Test.assertEquals(np.round(rating_matrix.mean(),2), 0.22, 'incorrect result: user, item or rating is incorrect')


1 test passed.
1 test passed.

Finally, let's transform the rating matrix into a Compressed Sparse Row (CSR) format, since we will have to acces to the users values (read rows), so this format facilitates this type of access.


In [7]:
rating_matrix = rating_matrix.tocsr()

Data visualization

Next cell lets you visualize the values of (a reduced version of) the matrix as an image where each rating is depicted as a little square. Blue represents the absence of a rating and the yellow-red levels represent the rating value. The plot confirms that the matrix is sparse (most of the squares are blue).

You can adapt the code to visualize a larger fraction of the matrix (although, you should take care of the matrix size that you consider and your memory space).


In [8]:
%matplotlib inline  
from matplotlib import pyplot as plt

imagedata = rating_matrix[:200, :200].toarray()
# Review the method .toarray(), we will use it later

plt.imshow(imagedata, interpolation='nearest')


Out[8]:
<matplotlib.image.AxesImage at 0xaeed5f8c>

The above image also shows that:

  • some users rate a lot more movies than others;
  • some movies are the target of many more ratings than others. To analyze this behavior in detail, complete the next cell to plot the histogram of the number of items rated by each users and the number of ratings received by each item.

Review the method .getnnz([axis]) of crs matrices to compute the number of stored values (nonzeros) by rows (axis=1) or columns (axis=0).


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

# 1. Compute the number items of rated by each user
n_rat_user = # <FILL IN>
plt.figure(1, figsize=(4, 4))
hist_user = plt.hist(n_rat_user, histtype='bar', rwidth=0.8)

# 2. Compute the number ratings for each item
n_rat_item = # <FILL IN>
plt.figure(2, figsize=(4, 4))
hist_item = plt.hist(n_rat_item, histtype='bar', rwidth=0.8)

In [10]:
###########################################################
# TEST CELL
###########################################################

Test.assertEquals(hist_user[0][0], 560, 'incorrect result: histogram over the number of user ratings is incorrect')
Test.assertEquals(hist_item[0][0], 1146, 'incorrect result: histogram over the number of ratings by item is incorrect')


1 test passed.
1 test passed.

Creating training and test rating matrices

Now, to be able to train and evaluate the different methods, let's divide the rating matrix into two different matrix:

  • one of them with the 75% of the ratings for training the different recommenders;
  • other one, with the remaining 25%, for testing purposes.

Hint: you may need the following functions:


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

# From the reduced rating matrix extract the users, items and ratings. Use the sparse.find() method
[users, items, ratings] = #<FILL IN>

# Get the number of ratings
number_ratings = #<FILL IN>

# Compute the number of training ratings as the 75% of the total number of ratings
np.random.seed(0)
n_tr = #<FILL IN>

# Create a permuted range from 0 to the number of ratings
random_pos = #<FILL IN>

# Select the first n_tr positions of random_pos as the training positions, 
# and the remaining random_pos indexes and the positions of the testing ratings
pos_tr = #<FILL IN>
pos_test = #<FILL IN>

# Build the training and testing rating matrices
# Create a coo_matrix and, then, convert it to csr format
# Note that the number of users and items has not changed
rating_train = #<FILL IN>
rating_test = #<FILL IN>

In [12]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(n_tr, 75000, 'incorrect result: n_tr value is incorrect')
Test.assertEquals(np.sum(random_pos[:5]), 142525, 'incorrect result: random_pos values are incorrect')
Test.assertEquals(np.round(rating_train.mean(),2), 0.17, 'incorrect result: the values of the training rating matrix are incorrect')
Test.assertEquals(np.round(rating_test.mean(),2), 0.06, 'incorrect result: the values of the testing rating matrix are incorrect')


1 test passed.
1 test passed.
1 test passed.
1 test passed.

To speed up the evaluations of the recommender systems, let's create a reduced set with 50 testing users. Of course, you can later extend the method evaluation to all the users, although the evaluation of some methods can take several minutes.


In [13]:
np.random.seed(0)
all_users = np.random.permutation(number_users)
test_users = all_users[:50]

Baseline recommenders

In this section we are going to build two baseline approaches. In particular, let's consider:

  1. Mean based baseline which predicts new ratings as the average value of the ratings given by this user to previous rated items.
  2. Mode based baseline, similar the mean baseline, but using the most frequent value (mode) among the known ratings.

So, complete the following cells to implement the functions corresponding to these two approaches.

Hint: When you are working with a csr matrix R, you can get the i-th row of the matrix as R[i,:]. And, if you need their non-zero values, you can use its attribute .data to extract them; i.e., R[i,:].data will return the non-zero values of the i-th of a crs matrix.

1. Mean based baseline


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

def mean_recomender(user_id, item_ids, ratings):
    """ Calculate the output of a mean based recommender
    Args:
        user_id: id of the user to compute its output
        item_ids: numpy array with the ids of the items whose rating has to be estimated
        ratings: crs matrix with the ratings of all the users to the rated items
    Returns:
        pred_items: numpy array of dimensions 1 x number of items with the computed predictions for each item. 
        This prediction is computed as the mean value of the items which the user_id has already rated.
    """
    # Compute the number of items in the rating matrix
    n_items = #<FILL IN>
    # Select the values of the items rated by the user_id
    ratings_u = #<FILL IN>
    # Compute the mean value of the ratings_u
    if ratings_u.shape[0]>0:
        mean_rating = #<FILL IN>
    else: # Cold start problem (average rating)
        mean_rating = 3
    # Generate a numpy vector of dimensions 1 x n_items with all their values equal to mean_rating 
    pred_items = #<FILL IN>
    return pred_items

In [15]:
# Testing function mean_recomender()
user_id = 20
item_ids = np.array([8, 0, 100])
pred_mean = mean_recomender(user_id, item_ids, rating_train)
print pred_mean


[[ 2.768  2.768  2.768]]

In [16]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(pred_mean.shape, (1,3), 'incorrect result: pred_mean shape is incorrect')
Test.assertEquals(np.round(np.sum(pred_mean),2), 8.3, 'incorrect result: pred_mean values are incorrect')


1 test passed.
1 test passed.

2. Mode based baseline

Hint: You can compute the mode of a vector or matrix x over their rows with the command: node_x = mode(x, axis=0)[0]


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

def mode_recomender(user_id, item_ids, ratings):
    """ Calculate the output of a mode based recommender
    Args:
        user_id: id of the user to compute its output
        item_ids: numpy array with the ids of the items whose rating has to be estimated
        ratings: crs matrix with the ratings of all the users to the rated items
    Returns:
        pred_items: numpy array of dimensions 1 x number of items with the computed predictions for each item. 
        This prediction is computed as the mode value of the items which the user_id has already rated.
    """
    # Compute the number of items in the rating matrix
    n_items = #<FILL IN>
    # Select the values of the items rated by the user_id
    ratings_u = #<FILL IN>
    # Compute the mean value of the ratings_u
    if ratings_u.shape[0]>0:
        mode_rating = #<FILL IN>
    else: # Cold start problem (average rating)
        mode_rating = 3
    # Generate a numpy vector of dimensions 1 x n_items with all their values equal to mean_rating 
    pred_items = #<FILL IN>
    return pred_items

In [18]:
# Testing function mode_recomender()
user_id = 20
item_ids = np.array([8, 0, 100])
pred_mode = mode_recomender(user_id, item_ids, rating_train)
print pred_mode


[[ 3.  3.  3.]]

In [19]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(pred_mode.shape, (1,3), 'incorrect result: pred_mean shape is incorrect')
Test.assertEquals(np.round(np.sum(pred_mode),2), 9, 'incorrect result: pred_mean values are incorrect')


1 test passed.
1 test passed.

Performance evaluation

To evaluate the performance of the different recommenders, we are going to use two measurements:

  • The Mean Average Error: $$MAE = \frac{1}{N} \sum_{i=1}^N |p_i -r_i|$$
  • The Root Mean Square Error: $$ RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^N (p_i -r_i)^2}$$

Thus, complete the following cells two build two functions able to compute the MAE and RMSE values.


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

def get_MAE(pred_rating, real_rating):
    """ Calculate the MAE
    Args:
        pred_rating: crs matrix, with dimensions n_users x n_items, with the predicted ratings.
        real_rating: crs matrix, with dimensions n_users x n_items, with the real ratings. 
    Returns:
        MAE: Mean Absolute Error computed over the non-zero entries of real_rating. 
    """
    
    # Extract the non-zero positions of real_rating and their values (use sparse.find() method)
    [pos_users, pos_items, real_values] = # <FILL IN>
    # Extract the predicted values of the non-zero positions
    pred_values = # <FILL IN>
    # Compute the MAE (check np.absolute method)
    MAE = # <FILL IN>
    return MAE

In [21]:
###########################################################
# TEST CELL
###########################################################

matrix_1 = sparse.eye(10).tocsr()
matrix_2 = (1.2*sparse.eye(10)).tocsr()
matrix_2[0,0]= 0.4
Test.assertEquals(np.round(get_MAE(matrix_1, matrix_2),2), 0.24, 'incorrect result: MAE value is incorrect')


1 test passed.

In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

def get_RMSE(pred_rating, real_rating):
    """ Calculate the RMSE
    Args:
        pred_rating: crs matrix, with dimensions n_users x n_items, with the predicted ratings.
        real_rating: crs matrix, with dimensions n_users x n_items, with the real ratings. 
    Returns:
        RMSE: Root Mean Square Error computed over the non-zero entries of real_rating. 
    """
    
    # Extract the non-zero positions of real_rating and their values (use sparse.find() method)
    [pos_users, pos_items, real_values] = # <FILL IN>
    # Extract the predicted values of the non-zero positions
    pred_values = # <FILL IN>
    # Compute the RMSE (check np.sqrt and np.square methods)
    RMSE = # <FILL IN>
    return RMSE

In [23]:
###########################################################
# TEST CELL
###########################################################

matrix_1 = sparse.eye(10).tocsr()
matrix_2 = (1.2*sparse.eye(10)).tocsr()
matrix_2[0,0]= 0.4
Test.assertEquals(np.round(get_RMSE(matrix_1, matrix_2),2), 0.27, 'incorrect result: RMSE value is incorrect')


1 test passed.

Now, let's evaluate the performance of the mean and mode based baselines.


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

# Compute the number of users and items
n_users, n_items = # <FILL IN>
# Create two empty prediction matrix in crs format 
pred_mean_ratings = sparse.lil_matrix((n_users, n_items))
pred_mode_ratings = sparse.lil_matrix((n_users, n_items))

# Work user by user
for u in range(n_users):
    # Get, form the test matrix, the item id to be predicted for this user (check .indices attribute of crs matrix)
    item_ids = # <FILL IN>
    # Get predictions with the mean based baseline for user u
    pred_mean_u = # <FILL IN>
    # Get predictions with the mode based baseline for user u
    pred_mode_u = # <FILL IN>
    # Build the prediction matrices
    pred_mean_ratings[u,item_ids] = # <FILL IN>
    pred_mode_ratings[u,item_ids] = # <FILL IN>
    
#Compute the error (MAE and RMSE) for each baseline method over the test_users
MAE_mean = # <FILL IN>
RMSE_mean = # <FILL IN>

MAE_mode = # <FILL IN>
RMSE_mode = # <FILL IN>

print 'Mean model ... MAE: %2.2f , RMSE: %2.2f ' % (MAE_mean, RMSE_mean)
print 'Mode model ... MAE: %2.2f , RMSE: %2.2f ' % (MAE_mode, RMSE_mode)

In [25]:
###########################################################
# TEST CELL
###########################################################

Test.assertEquals(np.round(MAE_mean,2), 0.84, 'incorrect result: MAE value of mean recommeder is incorrect')
Test.assertEquals(np.round(RMSE_mean,2), 1.04, 'incorrect result: RMSE value of mean recommeder is incorrect')
Test.assertEquals(np.round(MAE_mode,2), 0.86, 'incorrect result: MAE value of mode recommeder is incorrect')
Test.assertEquals(np.round(RMSE_mode,2), 1.19, 'incorrect result: RMSE value of mode recommeder is incorrect')


1 test passed.
1 test passed.
1 test passed.
1 test passed.

User based recommendations

In this section let's build a user-based collaborative filter by finding users who are similar to each other and user their rated items to predict the ratings over item which haven't been rated by other users.

The general algorithm will be (in pseudocode) as follows:

  1. For each user, rank every other user in terms of closeness. For this step, we will use the rating matrix and use correlation as the measure of closeness.
  2. Then, to estimate the ratings of unrated items for a given user, we will average the ratings of the neighbors of this user (using the correlation measurement computed in Step 1).

Step 1: Finding similar users

The similarity between different users is determined by comparing each user with every other user and computing a similarity score. Here, we are going to use the Pearson correlation coefficient: $$ sim(user_a, user_b) = \frac{\sum_{p \in P} (r_{a,p} -\bar{r}_a)(r_{b,p} -\bar{r}_b)} {\sqrt{ \sum_{p \in P} (r_{a,p} -\bar{r}_a)^2} \sqrt{ \sum_{p \in P} (r_{b,p} -\bar{r}_b)^2}}$$

where $P$ is set of items rated for both users a and b, $r_{u,p}$ is the rating of the user u to item p, and $\bar{r}_u$ is the mean value of the all the ratings of the user u.

Complete the next code so that it is able to obtain the correlation coefficient between to given users.


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

def compute_Pearson_correlation(ratings, id_u1, id_u2):
    """ Calculate correlation coefficient
    Args:
        ratings: crs matrix, with dimensions n_users x n_items, with the ratings used to measure similarities.
        id_u1: id user 1
        id_u2: id user 2   
    Returns:
        corr_value: correlation coefficient
    """
    # Get the indexes and values of the items rated by user 1 (use sparse.find() function)
    [pos_u1, items_u1, values_u1] = # <FILL IN>
    # Get the indexes and values of the items rated by user 2 (use sparse.find() function)
    [pos_u2, items_u2, values_u2] = # <FILL IN>
    # Get the set of items rated by both users (you can use np.intersect1d() method)
    items_intersect = # <FILL IN>
    
    if items_intersect is not None: # If the are common rated items...
        # Compute the mean values of all the items rated by user 1 and user 2
        m_1 = # <FILL IN>
        m_2 = # <FILL IN>
        # Get the ratings of users 1 and 2 in items_intersect (you can use .toarray() method)
        r_u1 = # <FILL IN>
        r_u2 = # <FILL IN>
        # Remove their means
        r_u1 = # <FILL IN>
        r_u2 = # <FILL IN>
        # Compute the correlation coefficient
        corr_value = # <FILL IN>
        # Remove useless dimensions
        corr_value =np.squeeze(corr_value)
        
    else: # Else correlation is 0
        corr_value = 0
        
    # Checking that the correlation is not NaN (this would happen if the denominatior is 0), 
    # in this case, set the corrlation coefficient to 0 
    if math.isnan(corr_value): 
        corr_value = 0
        
    return corr_value

In [43]:
###########################################################
# TEST CELL
###########################################################

Test.assertEquals(np.round(compute_Pearson_correlation(rating_train, 5, 12),2), 0.36, 'incorrect result: correlation value is incorrect')


1 test passed.

Step 2: Recommending items

Once you know how a user is similar to other users, you would now like to know the movies that should be recommended for this user. So, let's assign a rating to each item by averaging the ratings that the similar users have given to that item according to this expression: $$ pred(user_a, item_i) = \bar{r_a} + \frac{\sum_{b \in N} sim(user_a, user_b) * (r_{b,i}- \bar{r_b})}{\sum_{b \in N} sim(user_a, user_b)}$$ where N is the number of neighbors of user a ($sim >0$) which have rated item i.

Thus, let's create a function, that using the correlation coefficient as similarity value, compute for a given user the ratings over some unseen items.


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

def user_sim_recommender(user_id, item_ids, ratings):
    """ Compute the recomendations for user_id over the item_ids with a user based collaborative filtering approach
    Args:
        user_id: id of the user to compute its output
        item_ids: numpy array with the ids of the items whose rating has to be estimated
        ratings: crs matrix with the ratings of all the users to the rated items
    Returns:
        pred_items: numpy array of dimensions 1 x number of items with the computed predictions for each item. 
    """
        
    # Get number of users 
    n_users = #<FILL IN>
    # Get the number of items in items_id
    n_items = #<FILL IN>
    
    # Create variables to save incremental versions of numerator and denominator 
    rating_w_acc = np.zeros((1,n_items))  # Numerator (for each item there is a value)
    sim_acc = 0 # Denominator
    
    # Build a reduced matrix of ratings with only the columns corresponding to item_ids
    ratings_items = #<FILL IN>
    
    # Now we move user by user and compute the corresponding term of the numerator and denominator
    for id_u in range(n_users):
        # Compute the similarity of user_id with id_u
        sim = #<FILL IN>
        # If there is similarity ...
        if sim>0:
            # Get items rated by id_u, among item_ids, and their values 
            # (use sparse.find() over the row id_u of ratings_items )
            [idx_users, pos_ratings_u, ratings_u] =  #<FILL IN>
            # If id_u has rated items among items_id ...
            if pos_ratings_u.shape[0]>0:
                # Get the mean value of all the items rated by id_u
                mean_id_u = #<FILL IN>
                # Update numerator (add term sim*(ratings_u-mean_id_u))
                rating_w_acc[:,pos_ratings_u] = rating_w_acc[:,pos_ratings_u] + #<FILL IN>
                # Update denominator (add sim)
                sim_acc = sim_acc + #<FILL IN>
    
    # Now, that all the terms of numerator and denominator are computed, calculate the predicted values
    # 1. Get the mean value of all the items rated by user_id
    mean_id_user = #<FILL IN>
    # 2. Get predictions
    # If this user has similar users (sim_acc>0)...
    if sim_acc >0:
        # Get predictions with general expresion 
        pred_items = #<FILL IN>
    else: # else (cold start problem)...
        # Give predictions as mean value (mean_id_user)
        pred_items = #<FILL IN>
    return pred_items

In [45]:
###########################################################
# TEST CELL
###########################################################

Test.assertEquals(np.round(np.sum(user_sim_recommender(20, np.array([2, 5, 8]), rating_train))), 9, 'incorrect result: correlation value is incorrect')


1 test passed.

Now, let's evaluate the performance of this recommender over all the users


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

print 'Please, be patient, this computation takes a while... '

# Compute the number of users and items
n_users, n_items = # <FILL IN>
# Create an empty prediction matrix in crs format 
pred_ratings = sparse.lil_matrix((n_users, n_items))

# Work user by user
for u in test_users:
    # Get, form the test matrix, the item id to be predicted for this user (check .indices attribute of crs matrix)
    item_ids = # <FILL IN>
    # Get predictions with the used based CF method for user u
    pred_u = # <FILL IN>
    # Build the prediction matrix
    pred_ratings[u,item_ids] = # <FILL IN>
    
# Compute the error (MAE and RMSE) over test_users
MAE = # <FILL IN>
RMSE = # <FILL IN>

print 'MAE: %2.2f , RMSE: %2.2f ' %(MAE, RMSE)

In [47]:
###########################################################
# TEST CELL
###########################################################

Test.assertEquals(np.round(MAE,2), 0.82, 'incorrect result: MAE value is incorrect')
Test.assertEquals(np.round(RMSE,2), 1.02, 'incorrect result: RMSE value is incorrect')


1 test passed.
1 test passed.

Advance work: you can try to modify the function compute_Pearson_correlation() to force than a minimum number of common items have been rated (for instance, more than 4 items) to consider that two users can be similar. Check again the MAE and RMSE, have they been reduced? Is the method scalability improved?

Recommendation analysis

Once we have design a recommender system, we can analyze the movies which should be recommended for a given user. For this purpose, let's start loading the movies information and, then, analyze which movies should be recommended.

In the file 'u.item' we have the information associated to each item. So, let's start loading this file...


In [48]:
moviesFilename = 'u.item'
f = open(moviesFilename)
rawMovies = f.readlines()
print rawMovies[:5]


['1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0\n', '2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0\n', '3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0\n', '4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0\n', '5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0\n']

Each line in the dataset is formatted as:

>>   MovieID | Title | Date | url | Genres codification\n

Now, let's format each line to create a list of [MovieID, Title]. The remaining variables will be dropped because we do not need them for this exercise.


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

formatMovies = #<FILL IN>
print formatMovies[:5]

In [50]:
###########################################################
# TEST CELL
###########################################################

Test.assertEquals((len(formatMovies), len(formatMovies[0])), (1682,2), 'incorrect result: formatMovies dimensions are incorrect')
Test.assertEquals(formatMovies[10], ['11', 'Seven (Se7en) (1995)'], 'incorrect result: formatMovies content is incorrect')


1 test passed.
1 test passed.

Once we have the movie information, let's analyze the set of ten movies that we would recommend to the user with 20. Complete the following cell, following the instructions...


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

# 1. Compute the predictions, over all the items, that the user based system would provide for the user with id 20
user_id = 20
# Define the list of items as a list with all the item ids
item_ids = #<FILL IN>
# Get the predicitions (use user_sim_recommender() function)
list_pred = #<FILL IN>
# Remove useless dimensions of list_pred
list_pred = np.squeeze(list_pred)

# 2. Sort the list of predicted ratings, placing the highest ratings at the first
pos_ord = #<FILL IN>

# 3. Print the film titles with the ten highest ratings 
for i in range(10):
    # Get the id of the movie sorted at position i
    id_movie = #<FILL IN>
    print '%d: %s with rating %2.2f' %(i+1, formatMovies[id_movie][1], list_pred[id_movie])

In [58]:
###########################################################
# TEST CELL
###########################################################

Test.assertEquals(np.round(sum(list_pred[:10]),2), 37.63, 'incorrect result: list_pred is incorrect')
Test.assertEquals(sum(pos_ord[:10]), 1579, 'incorrect result: pos_ord is incorrect')


1 test passed.
1 test passed.

Alternative Least Squares algorithm (optional exercise)

Finally, let's implement the ALS algorithm. As you, know this method tries to approximate the ratings matrix by factorizing it as the product of two matrices:

$$ R = X * Y $$

where $X$ describes properties of each user, and $Y$ describes properties of each item. These two matrices are known as latent factors, since they are a low-dimension representation of users and items.

These matrices are computed such that the error for the users/items pairs over the known ratings is minimized:

$$ \min_{X, Y} \sum_{(u, i) \in P} \left( r_{u,i} - X_u Y_i \right)^2 + \lambda \left( ||X||^2 + ||Y||^2 \right) $$

where P is the set known rated items.

The ALS algorithm does this by first randomly selecting the latent factors matrices $X$ and $Y$ and then, given the values of $Y$, it optimizes the values of $X$ such that the error is minimized:

$$ \min_{X} \sum_{(u, i) \in P} \left( r_{u,i} - X_u Y_i \right)^2 + \lambda ||X||^2 $$

Then, by the computed users' factors ($X$), it optimizes the value of the item's matrix solving: $$ \min_{Y} \sum_{(u, i) \in P} \left( r_{u,i} - X_u Y_i \right)^2 + \lambda ||Y||^2 $$

This alternation between which matrix to optimize is the reason for the "alternating" in the name.

To solve this optimization problem, either for users or items factors, you can use the ridge regression implementation of scikit-learn.


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

from sklearn.linear_model import Ridge

def train_ALS(ratings, lambda_, n_factors):
    """ Compute the latent factors of the ALS algorithm
    Args:
        ratings: crs matrix with the ratings of all the users to the rated items
        lambda_: regularization parameter
        n_factors: number of latent factors
    Returns:
        X, Y: latent factor matrices of users and items 
    """
    
    # Parameters
    n_iterations = 20
    
    # Get the number of users and items
    n_users, n_items = # <FILL IN> 
    
    # Random initialization of latent factors
    np.random.seed(0)
    X = 5 * np.random.rand(n_users, n_factors) 
    Y = 5 * np.random.rand(n_factors, n_items)

    # Define the classifier
    clf = Ridge(alpha=lambda_, fit_intercept=False, max_iter=100,tol=0.01)
    for ii in range(n_iterations):
        for u in range(n_users):
            # From ratings matrix get the rated items by user u (use toarray() method)
            # Use np.squeeze to remove useless dimensions of r_u
            r_u = # <FILL IN> 
            # Let's create an index matrix indicating the positions where there is or there isn't a rating
            w_u = # <FILL IN> 
            # Solve the optimization problem
            # Find X_u to minimize (w_u*(r_u-X[u,:]*Y)^2)
            clf.fit(Y.T, r_u.T,w_u.T)  
            # Get the coefficients computed by the model and add it to the latent factor matrix
            X[u,:] = # <FILL IN> 
        for i in range(n_items):
            # From ratings matrix get the rating corresponding to item i (use toarray() method)
            # Use np.squeeze to remove useless dimensions of r_i
            r_i = # <FILL IN> 
            # Let's create an index matrix indicating the positions where there is or there isn't a rating
            w_i = # <FILL IN> 
            # Solve the optimization problem
            # Find Y_i to minimize (w_i*(r_i-X*Y[i,:])^2) 
            clf.fit(X, r_i,w_i) 
            # Get the coefficients computed by the model and add it to the latent factor matrix
            Y[:,i] = # <FILL IN> 
        
        # To analyze error evolution
        # Get predictions (use np.dot to multiply latent factor matrices)
        pred_ratings = # <FILL IN> 
        # Compute the error (MAE and RMSE) 
        MAE = # <FILL IN> 
        RMSE = # <FILL IN> 
        print 'Iteration: %d, MAE: %2.2f , RMSE: %2.2f ' % (ii, MAE, RMSE)

    return X, Y

In [60]:
# Test the ALS funtion

# parameters
lambda_ = 10
n_factors = 10

# Train the ALS model
X, Y = train_ALS(rating_train, lambda_, n_factors)


Iteration: 0, MAE: 1.41 , RMSE: 1.68 
Iteration: 1, MAE: 0.97 , RMSE: 1.20 
Iteration: 2, MAE: 0.78 , RMSE: 0.98 
Iteration: 3, MAE: 0.71 , RMSE: 0.89 
Iteration: 4, MAE: 0.68 , RMSE: 0.86 
Iteration: 5, MAE: 0.66 , RMSE: 0.84 
Iteration: 6, MAE: 0.66 , RMSE: 0.83 
Iteration: 7, MAE: 0.65 , RMSE: 0.82 
Iteration: 8, MAE: 0.65 , RMSE: 0.82 
Iteration: 9, MAE: 0.65 , RMSE: 0.82 
Iteration: 10, MAE: 0.64 , RMSE: 0.82 
Iteration: 11, MAE: 0.64 , RMSE: 0.81 
Iteration: 12, MAE: 0.64 , RMSE: 0.81 
Iteration: 13, MAE: 0.64 , RMSE: 0.81 
Iteration: 14, MAE: 0.64 , RMSE: 0.81 

In [61]:
###########################################################
# TEST CELL
###########################################################

Test.assertEquals(np.round(np.mean(X),2), 0.13, 'incorrect result: X values are incorrect')
Test.assertEquals(np.round(np.mean(Y),2), 0.08, 'incorrect result: Y values are incorrect')


1 test passed.
1 test passed.

Now, let's compute the error over the test data


In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

# Get predictions (use np.dot to multiply latent factor matrices)
pred_ratings = # <FILL IN> 
# Compute the error (MAE and RMSE) over test_users
MAE = # <FILL IN>
RMSE = # <FILL IN>

print 'MAE: %2.2f , RMSE: %2.2f ' % (MAE, RMSE)

In [63]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(np.round(MAE,2), 0.78, 'incorrect result: MAE value is incorrect')
Test.assertEquals(np.round(RMSE,2), 1, 'incorrect result: RMSE value is incorrect')


1 test passed.
1 test passed.